fix: null-guard responseObject derefs across NSO ActivityPolicies (DLQ leak)#230
fix: null-guard responseObject derefs across NSO ActivityPolicies (DLQ leak)#230ecv wants to merge 3 commits into
Conversation
The create rule's summary dereferenced audit.responseObject.metadata.name, but a rejected create (e.g. 409/422) returns a Status object with no metadata.name. The rule still matched (it only checks requestObject.spec), so CEL evaluation raised "no such key: name", the event went to the DLQ, and retries failed identically -- a slow, steady DLQ leak (DLQSlowLeak in prod). Guard the leaf path and fall back to objectRef.name, preserving the server-assigned-name display for successful creates while staying null-safe for rejected ones. Fixes the prod DLQSlowLeak on policy gateway.networking.k8s.io-gateway (error_type=cel_summary).
…olicies
Same DLQ-leak class as the gateway create rule: create/update summaries
dereference audit.responseObject.{metadata.name,spec.domainName}, but a
rejected create/update returns a Status object with no spec/metadata.name, so
CEL raises "no such key" and the event leaks to the DLQ.
Guard the leaf and fall back to audit.objectRef.name (always present) across:
- backendtlspolicy, connector, connectoradvertisement, trafficprotection (create)
- domain (create + update)
delete rules already guard has(audit.responseObject.spec) and are unchanged.
|
Interesting, does the log capture what the full audit log payload is? I'm curious why the metadata name isn't available in the response object of the create operation. |
|
Successful creates have it, only failures don't. Quoth the raven,
But |
|
Yes — the processor logs the full audit event. {
"auditID": "5c93f394-87b7-4108-912a-920facc12774",
"level": "RequestResponse",
"verb": "create",
"objectRef": {
"apiGroup": "networking.datumapis.com",
"apiVersion": "v1alpha1",
"namespace": "default",
"resource": "connectors"
},
"requestObject": {
"kind": "Connector",
"metadata": { "generateName": "datum-connect-" },
"spec": { "connectorClassName": "datum-connect" }
},
"responseObject": {
"apiVersion": "v1",
"kind": "Status",
"code": 403,
"status": "Failure",
"reason": "Forbidden",
"message": "connectors.networking.datumapis.com \"datum-connect-2qjq6\" is forbidden: Your request took too long to be checked against your quota. Please try again in a moment — if this keeps happening, contact support.",
"metadata": {},
"details": { "group": "networking.datumapis.com", "kind": "connectors", "name": "datum-connect-2qjq6" }
},
"responseStatus": { "code": 403, "status": "Failure", "reason": "Forbidden", "details": { "name": "datum-connect-2qjq6" } }
}This confirms the root cause concretely: the create was rejected (403, quota check timed out), so But this payload also shows the current fix is incomplete for "objectRef": { "apiGroup": "...", "apiVersion": "v1alpha1", "namespace": "default", "resource": "connectors" }The client used — takes the else branch and derefs The name is present in the rejected event — at Gateway likely uses explicit names (so |
… creates The earlier fix guarded responseObject.<leaf> and fell back to audit.objectRef.name, but a live audit event from staging exposed two gaps: 1. generateName creates (connector, connectoradvertisement) rejected before a name is assigned have an EMPTY audit.objectRef.name. The fallback then derefs audit.objectRef.name on a Status responseObject and raises "no such key: name" again, so the event still dead-letters. The name is carried on the Status at responseObject.details.name. 2. domain create/update used has(audit.responseObject.spec.domainName), but nested has(a.b.c) evaluates the intermediate selection and errors when a rejected Status has no spec. This is the same DLQ bug class for every rejected domain create/update. Replace each create/update summary with a per-level-guarded fallback chain: responseObject.<leaf> -> objectRef.name -> responseObject.details.name -> literal The details branch is guarded with has(responseObject.details) because some Status responses carry no details. domain also guards has(responseObject.spec). Evidence: activity-processor logs (processor.go:945) on staging show networking.datumapis.com-connector events re-failing the DLQ retry (retryCount climbing) on the previously "fixed" objectRef.name branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#672) ## What Make the `create` rule summaries null-safe in six ActivityPolicies that dereference `audit.responseObject.metadata.name`. Guard the leaf and fall back to `audit.objectRef.name`. ## Why Same DLQ-leak class as the prod `DLQSlowLeak` on the NSO gateway policy (milo-os/activity#212). A `create` rule matches on verb/requestObject, so a **rejected** create (409/422/admission-deny) still matches — but the apiserver returns a `Status` object as `responseObject` with no `metadata.name`. CEL raises `no such key: name`, the event goes to the DLQ, and retries fail identically. Latent until the first rejected create of each kind. ## Fixed (create rules) - `config/services/activity/policies/resourcemanager/` — project, organization - `config/services/activity/policies/iam/` — role, group, serviceaccount - `config/services/identity/policies/` — serviceaccount ## Remediation Merge + milo release → the `milo-activity-policies` Flux Kustomization re-applies the corrected CRs; processor retries any DLQ backlog. No `kubectl`. ## Related - NSO fix (gateway + 5 more): datum-cloud/network-services-operator#230 - billing fix: milo-os/billing (separate PR) - Runbook fix: milo-os/activity#213 · Investigation: milo-os/activity#212
What
Make
auditRules[].summaryCEL null-safe across all NSO ActivityPolicies thatdereference
audit.responseObject.<leaf>in a create/update rule. Guard the leafand fall back to
audit.objectRef.name.Why
Prod
DLQSlowLeakongateway.networking.k8s.io-gateway(
error_type=cel_summary, ~120 Gateway audit events/6h, climbing):The
create/updaterules match onhas(audit.requestObject.spec), so arejected request (409/422/admission-deny) still matches — but the apiserver
returns a
Statusobject asresponseObject, with nospec/metadata.name.CEL raises
no such key, the event goes to the DLQ, and retries failidentically. Slow steady leak.
An audit of every NSO ActivityPolicy found the same latent bug in 5 more files.
Fixed rules
create(the prod leak)create(responseObject.metadata.name)create+update(responseObject.spec.domainName)deleterules already guardhas(audit.responseObject.spec)— unchanged.httproute/httpproxydelete useresponseObject.spec.hostnames[0]but matchguards
has(...hostnames) && size(...) > 0— safe, unchanged.Prod remediation
Merge + NSO release → Flux re-applies the corrected ActivityPolicy CRs
(overwriting the live objects); processor immediately retries the DLQ backlog.
No
kubectl.Related (same bug class, separate repos)